Skip to content

Support Linear State in SDPA Pipeline#3359

Open
apaniukov wants to merge 31 commits intoopenvinotoolkit:masterfrom
apaniukov:lfm2-stateful-model
Open

Support Linear State in SDPA Pipeline#3359
apaniukov wants to merge 31 commits intoopenvinotoolkit:masterfrom
apaniukov:lfm2-stateful-model

Conversation

@apaniukov
Copy link
Contributor

@apaniukov apaniukov commented Feb 19, 2026

Description

Support fixed-size cache state for linear/hybrid attention models.

Core abstraction — CacheTypes and CacheState (utils.hpp, utils.cpp)

  • New CacheTypes class with a bitmask tracking has_kvcache() and has_linear().
  • New CacheState class replacing the former KVCacheState, carrying cache type information, trim/reset flags, and the token-history mirror.
  • get_cache_types(const ov::Model&) detects cache kinds from ReadValue node shapes: 4D dynamic = KV-cache, 3D dynamic = linear (SSM) state.
  • trim_kv_cache() updated to handle hybrid models: resets linear caches on trim (full reset), trims only KV-cache tensors for attention.

Stateful LLM pipeline (pipeline_stateful.cpp, lm_encoding.cpp)

  • m_cache_state (formerly KV-only) replaced with CacheState constructed from the model on init.
  • align_kv_cache_and_history() sets reset_mem_state = true explicitly when state is empty (preserving first-call reset behavior, now decoupled from needs_reset()).

VLM pipeline (pipeline.cpp, inputs_embedder.cpp)

  • CacheState propagated to VLM's language-model path; cache types set from the language model on construction.

Speculative decoding (fast_draft_strategy.cpp, eagle3_strategy.cpp)

  • Both LLMInferWrapper and Eagle3InferWrapperBase now detect cache types from the model and store m_cache_types.

Tests

  • New test_cache_types.cpp: CSV-driven parameterized GTests for get_cache_types() against real converted OV models.
  • data/cache_types_models.csv: 3 model entries (Phi3=kvcache-only, LFM2=hybrid, Mamba=linear-only).
  • run_cache_types_tests.sh: local helper script to convert models and run tests.
  • CMakeLists.txt: installs data/ alongside the test binary.

CI

  • New "Convert models for cache types gtests" step converts models from HuggingFace (with caching).
  • TEST_MODELS_BASE_DIR and CACHE_TYPES_CSV env vars passed to the existing gtests step.

CVS-181414

Checklist:

  • This PR follows GenAI Contributing guidelines.
  • Tests have been updated or added to cover the new code.
  • This PR fully addresses the ticket.
  • I have made corresponding changes to the documentation.

Copilot AI review requested due to automatic review settings February 19, 2026 12:36
@apaniukov apaniukov changed the title Support Linear State in SDPA Pipeline [WiP] Support Linear State in SDPA Pipeline Feb 19, 2026
@github-actions github-actions bot added category: visual language Visual language pipeline category: LLM LLM pipeline (stateful, static) category: speculative decoding Speculative decoding no-match-files labels Feb 19, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR generalizes the “KV cache state” tracking to support fixed-size linear (and hybrid) cache state in stateful/SDPA-based pipelines by introducing a unified cache state type and propagating it through LLM/VLM/speculative decoding codepaths.

Changes:

  • Replaced KVCacheState with CacheState across pipelines and embedders.
  • Added cache kind detection (CacheTypes / get_cache_types) and updated cache-trimming behavior to reset for linear caches.
  • Wired cache-kind awareness into speculative decoding wrappers and stateful LLM pipeline initialization.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/cpp/src/visual_language/vision_token_pruning_processor.hpp Updates pruning processor API to use CacheState.
src/cpp/src/visual_language/vision_token_pruning_processor.cpp Updates pruning processor implementation signature to CacheState.
src/cpp/src/visual_language/pipeline.cpp VLM pipeline now uses CacheState when managing chat history/cache trimming.
src/cpp/src/visual_language/phi4mm/classes.cpp Switches embedder history/cache bookkeeping to m_cache_state.
src/cpp/src/visual_language/phi3_vision/classes.cpp Switches embedder history/cache bookkeeping to m_cache_state.
src/cpp/src/visual_language/inputs_embedder.hpp Replaces stored state from KVCacheState to CacheState.
src/cpp/src/visual_language/inputs_embedder.cpp Updates chat/history alignment and rollback bookkeeping to CacheState.
src/cpp/src/utils.hpp Introduces CacheTypes, CacheState, and get_cache_types() API.
src/cpp/src/utils.cpp Implements cache kind detection and updates trim_kv_cache() behavior for linear caches.
src/cpp/src/speculative_decoding/stateful/fast_draft_strategy.hpp Adds CacheTypes member to infer wrapper.
src/cpp/src/speculative_decoding/stateful/fast_draft_strategy.cpp Initializes CacheTypes and uses it to build CacheState for trimming.
src/cpp/src/speculative_decoding/stateful/eagle3_strategy.hpp Adds CacheTypes member to eagle3 infer wrapper base.
src/cpp/src/speculative_decoding/stateful/eagle3_strategy.cpp Initializes CacheTypes and uses it to build CacheState for trimming.
src/cpp/src/lm_encoding.hpp Updates encoding helpers to accept CacheState.
src/cpp/src/lm_encoding.cpp Updates chat-history alignment logic and cache-state updates for CacheState.
src/cpp/src/llm/pipeline_stateful.hpp Renames stored cache reflection to m_cache_state and renames reset helper.
src/cpp/src/llm/pipeline_stateful.cpp Initializes CacheState from model and propagates it through chat/trim logic.
Comments suppressed due to low confidence (1)

src/cpp/src/utils.cpp:525

  • trim_kv_cache() resets the InferRequest when reset_mem_state is set (or when linear cache needs reset), but it returns without clearing cache_state.reset_mem_state / num_tokens_to_trim or updating the token reflection state. This can leave CacheState inconsistent (stale tokens / repeated resets) for subsequent steps. Consider resetting the CacheState fields when a reset happens (and clearing the token reflection if the underlying model state is cleared).
void trim_kv_cache(ov::InferRequest request, CacheState& cache_state, std::optional<AdapterController> adapter_controller) {
    if (
        cache_state.reset_mem_state
        // linear cache stores only the last state, trimming is not possible, so we reset the whole cache in this case
        || (cache_state.num_tokens_to_trim > 0 && cache_state.has_linear())
    ) {
        if (adapter_controller) {
            for(auto& state: request.query_state()) {
                if(!adapter_controller->has_state_name(state.get_name())) {
                    state.reset();
                }
            }
        } else {
            request.reset_state();
        }

        return;

@as-suvorov
Copy link
Collaborator

I converted PR to draft as it labeled as WIP

@as-suvorov as-suvorov marked this pull request as draft February 19, 2026 12:54
Copilot AI review requested due to automatic review settings February 20, 2026 13:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Copilot AI review requested due to automatic review settings February 26, 2026 10:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Comment on lines +439 to +455
CacheTypes get_cache_types(std::shared_ptr<const ov::Model> model) {
// "ReadValue" node is cache representation in stateful model
const std::string state_node_type_name = std::string(ov::op::v6::ReadValue::get_type_info_static().name);
CacheTypes cache_types;

for (const auto op : model->get_ops()) {
// check input size, as in LoRA adapters case it could be 0
if (op->get_type_name() != state_node_type_name || op->get_input_size() < 1) {
continue;
}

// Shape example: [-1,4,0,64]
auto shape = op->get_input_partial_shape(0);
const auto rank = shape.rank().get_length();
size_t dynamic_axis_count = 0, zero_axis_count = 0;
for (size_t i = 0; i < rank; i++) {
if (shape[i].is_dynamic()) {
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_cache_types() calls shape.rank().get_length() unconditionally. If a ReadValue input has dynamic rank, get_length() can throw; this would make cache-type detection fail at runtime for some models. Guard with shape.rank().is_dynamic() (skip/continue or handle) before calling get_length(), and similarly avoid iterating dimensions when rank is dynamic.

Copilot uses AI. Check for mistakes.
Comment on lines +439 to +444
CacheTypes get_cache_types(std::shared_ptr<const ov::Model> model) {
// "ReadValue" node is cache representation in stateful model
const std::string state_node_type_name = std::string(ov::op::v6::ReadValue::get_type_info_static().name);
CacheTypes cache_types;

for (const auto op : model->get_ops()) {
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New cache-type detection (get_cache_types) and the linear-cache reset path in trim_kv_cache() introduce non-trivial behavior that can regress chat/history correctness. There are existing gtests for utils (e.g., tests/cpp/utils.cpp), but no coverage for these new paths; please add unit tests covering KV-only, linear-only, and hybrid detection and verifying reset/trim bookkeeping.

Copilot generated this review using guidance from repository custom instructions.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.

Comment on lines +670 to +679
// PA backend does not support linear attention states (conv/SSM caches).
if (attention_backend == PA_BACKEND
&& utils::has_linear_attention_states(models_dir, properties)) {
if (utils::explicitly_requires_paged_attention(user_properties)
|| user_properties.find("ATTENTION_BACKEND") != user_properties.end()) {
GENAI_WARN("PA backend does not support models with linear attention states. The model may work incorrectly.");
} else {
attention_backend = SDPA_BACKEND;
}
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_linear_attention_states(models_dir, properties) loads the language model to inspect its states, but VLMPipelineImpl(models_dir, ...) will also read/compile the same language model. Consider reusing the already-loaded language_model from VLMPipelineImpl (or reading it once in this constructor and passing it down) to avoid duplicated model reads/parsing at initialization.

Copilot uses AI. Check for mistakes.
Comment on lines +457 to +460
for (const auto op : model.get_ops()) {
// check input size, as in LoRA adapters case it could be 0
if (op->get_type_name() != state_node_type_name || op->get_input_size() < 1) {
continue;
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In get_cache_types(), the loop uses for (const auto op : model.get_ops()), which copies each shared_ptr and bumps the atomic ref-count for every op. Using const auto& op avoids that overhead (and is more consistent with performance-sensitive model graph walks).

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 10, 2026 10:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated no new comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +463 to +468
// Shape example: [-1,4,0,64]
auto shape = op->get_input_partial_shape(0);
const auto rank = shape.rank().get_length();
size_t dynamic_axis_count = 0, zero_axis_count = 0;
for (size_t i = 0; i < rank; i++) {
if (shape[i].is_dynamic()) {
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_cache_types() calls shape.rank().get_length() without checking shape.rank().is_static(). If a model contains a ReadValue with dynamic rank, this will throw/assert inside OpenVINO. Add a guard (e.g., if (!shape.rank().is_static()) continue;) before using get_length() and iterating the rank.

Copilot uses AI. Check for mistakes.
Comment on lines 503 to +507
// Shape example: [-1,4,0,64]
auto shape = op->get_input_partial_shape(0);
if (shape.rank().get_length() != 4) {
// kv cache should have 4 dimensions
continue;
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_kv_axes_pos() uses shape.rank().get_length() in the != 4 check without verifying that the rank is static. If rank is dynamic, get_length() can throw/assert. Consider checking shape.rank().is_static() first and skipping ReadValue nodes with dynamic rank.

Copilot uses AI. Check for mistakes.
Comment on lines 100 to +101
// get reflection of tokens contained in the kv cache
utils::KVCacheState& get_kv_cache_state();
utils::CacheState& get_kv_cache_state();
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says this returns a reflection of tokens contained in the KV cache, but the type was changed to utils::CacheState and now covers non-KV cache kinds (e.g., linear/SSM state) as well. Consider updating the comment (and possibly the accessor name) so it matches the new semantics.

Copilot uses AI. Check for mistakes.


bool has_linear_attention_states(const std::filesystem::path& models_path, const ov::AnyMap& properties) {
return get_cache_types(*read_model(models_path, properties)).has_linear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used in VLM constructor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model based constructors are needed in this case for VLMPipelineImpl and VLMContinuousBatchingAdapter. Model reading is heavy

Copilot AI review requested due to automatic review settings March 10, 2026 12:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings March 10, 2026 18:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 1 comment.

from dataclasses import dataclass
from pathlib import Path
from typing import Type
import subprocess
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import subprocess will be flagged by Bandit (B404) in this repo (Bandit runs recursively without excluding tests/). Other test files suppress this with # nosec B404 on the import. Consider adding the same suppression here (or refactoring to reuse the existing helper that already carries the suppression) to avoid CI failures.

Suggested change
import subprocess
import subprocess # nosec B404

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants